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ABSTRACT 

A  new,  low  complexity  method  facilitates  low  burden  em¬ 
bedding  and  recovery  of  tonal  watermarks  in  speech.  A 
watermark  composed  of  a  periodically  extended  sequence  of 
sub-audible  DTMF  tones  is  added  to  speech  asynchronously, 
without  regard  to  momentary  speech  characteristics.  It  is 
detected  through  a  combination  of  a  bit  manipulation  en¬ 
hancement  and  a  data-directed  correlation,  ideal  for  simple 
hardware  implementations.  Three  methods  of  bit  manipu¬ 
lation  enhancement  were  auditioned  and  the  best  selected 
for  further  investigation.  It  showed  an  average  26  dB  pro¬ 
cessing  gain  vs.  correlation  alone,  sufficient  to  detect  the 
asynchronous  sub-audible  tones  by  a  comfortable  margin. 

Index  Terms —  Speech  Watermarking,  Hidden  Tones, 
Speech  Steganography,  Speech  Data  Hiding 

1.  BACKGROUND 

Imperceptibly  embedded  data  can  be  used  to  stamp  speech 
with  a  watermark.  In  many  applications  the  watermark  must 
be  transparent  to  the  listener  of  the  speech  content,  and  should 
not  rob  any  power  from  the  signal  or  affect  its  content  by  no¬ 
ticeably  changing  the  speech  power  level  or  its  intelligibility. 
Additionally,  it  would  be  ideal  to  minimize  any  delay,  pro¬ 
cessing  load,  or  system  modification  burden  at  the  point  of 
watermark  generation  and  insertion.  It  would  also  be  desir¬ 
able  to  have  a  low  complexity  recovery  method. 

Prior  researchers’  approaches  have  included  directly  re¬ 
placing  the  lower  bits  in  PCM  samples  [1],  replacing  the  un¬ 
voiced  CELP  residual  [2],  impressing  coded  phase  changes 
onto  the  analog  waveform,  hiding  spread  spectrum  under  for¬ 
mants  [3],  and  inserting  short  tones  at  frame  by  frame  com¬ 
puted  levels  [4]. 

Many  of  those  approaches  tried  to  minimize  the  difficulty 
in  watermark  recovery  by  maximizing  the  watermark  power. 
That  was  done  by  inserting  data  piecemeal  at  higher  power 
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levels,  skirting  the  threshold  of  hearing  and  the  limits  of  per¬ 
ceptual  masking.  These  methods  attempt  to  mask  data  by  in¬ 
serting  it  only  into  certain  strongly  voiced  speech  segments, 
or  by  inserting  it  all  throughout  speech,  but  at  custom  power 
ratios  calculated  for  each  short  segment.  These  approaches  re¬ 
quire  processing  buffer  delays  that  preclude  real-time,  instan¬ 
taneous  encoding.  They  also  require  considerable  processing 
load,  both  at  the  insertion  stage  and  at  the  recovery. 

2.  INTRODUCTION 

The  proposed  new  method  allows  instantaneous  encoding 
through  a  simple  mixing  of  DTMF  tones.  It  adds  the  tones 
asynchronously,  without  any  knowledge  of  the  momentary 
speech  details,  or  of  any  piecemeal  speech/data  power  rela¬ 
tionships. 

Human  perception  is  quite  sensitive  to  tones,  particularly 
in  very  clean  speech,  so  they  must  be  inserted  at  a  very  low 
level,  making  recovery  extremely  difficult.  Informal  listening 
found  the  tones  inaudible  at  a  roughly  -50  dB  power  level. 

The  new  recovery  method  has  two  components:  pre¬ 
processing  by  bit  manipulations,  and  a  data-directed  correla¬ 
tion.  This  paper  compares  the  detection  by  correlation  alone 
to  that  after  enhancement  by  a  low  complexity  method. 

An  extra  benefit  of  this  scheme  is  that  the  calculation  and 
analysis  load  is  borne  essentially  by  the  detection/recovery 
process,  with  minimal  burden  at  the  encoding  end.  That  also 
means  that  minimal  technical  equipment  changes  are  needed 
to  add  watermarks,  and  that  any  significant  changes  are  re¬ 
quired  for  only  those  interested  in  detecting  or  decoding  the 
watermark. 

2.1.  Watermark  Embedding 

Assume  that  a  watermark  signal  is  scaled  and  added  to  a  trun¬ 
cated  speech  signal 

y  =  s  +  X w  £  7^xl  (1) 

where  «  e  Ji6Xl  is  the  speech  signal  represented  as  a  16-bit 
signed  integer  code,  A  £  SR  is  a  scaling  factor,  and  w  £  I-fjA1 
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is  the  watermark.  In  general,  A  is  independent  of  s.  When  the 
speech  signal  is  available,  the  value  for  A  may  be  calculated 


A  = 


Sn=1(^n)2 

StlK)2 


1/2 


lOr/tO 


where  sn  and  wn  are  the  components  of  the  speech  and  wa¬ 
termark  signal,  and  r  is  a  desired  watermark  to  speech  power 
ratio  in  dB.  If  the  speech  signal  is  not  available,  the  value  of 
A  can  be  determined  by  an  arbitrary  estimate  of  the  power  of 
an  average  speech  signal. 

In  the  experiments  which  follow,  the  watermark  signal  w 
was  derived  from  a  sequence  of  P  DTMF  tones 


Op  =  [di, . . . ,  dp]  (2) 


where  each  DTMF  tone  di  £  IiqK  had  a  duration  of  100 
milliseconds  (i.e.  K  =  /s/10,  for  a  sample  rate  of  fs).  Since 
there  are  16  available  DTMF  tones,  a  total  of  16p  unique 
DTMF  sequences  could  be  generated.  The  watermark 


w  =  [0 


(i) 


P  !  •  •  •  )  WP 


3(9)1 T 


(3) 


was  then  constructed  by  repeating  9 p  until  the  length  of  the 
watermark  ( qKP )  was  equal  to  the  number  of  samples  in  s. 
Note  that  the  original  speech  signal,  s,  was  truncated  to  s; 
a  segment  whose  length  is  a  multiple  of  KP  to  match  the 
DTMF  sequence. 


where  e  is  the  estimation  error  that  results  from  substitut¬ 
ing  E  with  A/j\r('(nn_|_rn?/n)  —  1  tnn-(-mt/n. 

Since 

MN(wn+myn )  =  MN(wn+m(sn  +  Xwn)) 

=  MN(wn+ 

m^n  )  +  XMN(wn+mWn), 

we  see  that  e  =  e\  +e2  :  ei  =  E  [wn+msn]-  MN(wn+msn) 
and  e2  =  A (E  [wn+mwn]  -  MN(wn+mWn)).  The  law  of 
large  numbers  states  that  <7^  =  a2^  .  /N  and  = 

x2fJln+mwn/N’  anh  since  the  watermark  signal  Xwn  is  in¬ 
tentionally  many  decibels  below  speech  sn  in  power  level  we 
may  presume  that  ss  .  Therefore  once  the  waveforms 
and  parameters  {wn,  sn,  A,  N}  are  selected  one  may  attempt 
to  reduce  by  reducing  the  variance  of  wn+msn  through 
some  kind  of  nonlinear  processing  prior  to  correlation. 

Our  work  has  been  to  apply  three  different  instantaneous 
nonlinearities  to  the  watermarked  speech,  yn  =  sn  +  X wn, 
in  order  to  improve  the  resulting  estimate  of  the  autocorrela¬ 
tion  function  Rww(m).  To  ensure  computational  efficiency, 
each  of  the  three  nonlinear  preprocessing  methods  are  shown 
below  to  have  simple  implementations  using  bit-level  manip¬ 
ulations  on  signed  integer  (also  known  as  2’s-complement) 
binary  codes. 

The  first  method  that  we  have  investigated  for  improv¬ 
ing  watermark  in  speech  recovery  we  have  called  the  REM 
method.  It  gets  its  name  from  the  remainder  function  that  de¬ 
fines  it  as 


2.2.  Correlation  Analysis 

The  true  cross-correlation  sequence  between  the  watermark 
and  the  speech  is 

RWy(rn]  —  E  ]wn-\.rnyn]  (4) 

where  wn  and  yn  are  stationary  random  processes  represent¬ 
ing  the  watermark  and  speech  plus  watermark  respectively, 
—00  <  n  <  00,  and  E  [■]  is  the  expectation  operator.  Assum¬ 
ing  that  w  and  y  are  independent  and  that  either  the  expected 
value  of  the  watermark  or  the  speech  is  zero,  using  Eq.  (1) 
the  cross-correlation 

Rwy(m)  —  E  [zUri+m]  E  [.§n]  T  A E  ['ZUn-Tmtnn] 

—  A E  [iun_(-m'tnn]  —  XRww{m) 

is  equal  to  a  constant  times  the  autocorrelation  of  the  water¬ 
mark  signal. 

3.  ANALYSIS  OF  RECOVERY  METHODS 

3.1.  Preprocessing  by  Bit  Manipulation 

In  practical  application,  a  sample  mean  is  used  to  estimate  the 
expectation  operator  in  Eq.  (4): 

E  [t^n+ml/n]  ~  —  XRww{TYl)  T  e,  (5) 


REM(yn,k )  =  rem(yn  +  i,2fc)  -  i. 

With  signed  integer  codes,  the  REM  method  is  implemented 
as  follows:  retain  the  k  least-significant  bits  without  any 
change,  and  replace  all  other  bits  with  copies  of  the  sign  bit. 
The  second  method  is  an  amplitude  limiting  process 

AL(yn ,  k)  =  sign(yn)  ■  min(\yn\,2k). 

With  signed  integer  codes  the  AL  method  is  implemented  as 
follows:  if  all  except  the  k  right-most  bits  are  the  same  in 
value,  then  make  no  change.  Otherwise  clear  the  k  right-most 
bits,  set  the  bit  to  their  left,  and  replace  all  other  bits  with 
copies  of  the  sign  bit.  Finally,  the  third  of  our  processing 
methods  is  the  SIGN  method: 

SIGN(yn )  =  [yn  >  0]  -  1, 

where  the  test  for  yn  >  0  returns  1  if  true  and  0  if  false. 
When  applied  on  signed  integer  codes,  all  bits  are  replaced 
with  copies  of  the  sign  bit.  It  should  be  noted  that  both  the 
SIGN  and  REM  methods  introduce  a  d.c.  bias  that  may  be 
subtracted  if  desired. 

The  following  figure  shows  the  relative  processing  gain 
resulting  from  all  three  methods  on  a  sum  of  a  zero-mean, 
Gaussian  random  watermark  when  scaled  to  be  50  dB  below  a 


1406 


Fig.  1.  Processing  gain  while  comparing  SIGN,  REM,  and 
AL  methods. 


lOOHz-tone  model  for  speech  (N  =  106).  We  have  found  the 
nonlinear  processing  effectiveness  in  improving  output  SNR 
to  be  very  much  signal-dependent.  The  plot  above  shows  an 
experimental  result  where  the  REM  method  with  parameter 
k  =  10  achieved  in  excess  of  25  dB  processing  gain  compared 
to  cross-correlation  without  any  nonlinear  preprocessing. 


3.2.  Data-Directed  Watermark  Detection 

The  data-directed  correlation  detection  method  along  with  a 
threshold,  a,  provides  a  test  to  determine  whether  the  water¬ 
mark  is  present  in  the  speech  signal.  Using  a  modified  correla¬ 
tion,  the  method  returns  a  continuous  range  of  values  between 
0  and  5  where  the  higher  value  demonstrates  a  higher  level  of 
detection  confidence. 

The  Correlation  Detection  Score  (CDS)  is  a  measure  of 
the  quality  of  the  cross-correlation  between  w  and  y  as  com¬ 
pared  to  the  autocorrelation  of  the  watermark  w.  When  the 
error  e  is  small  (see  Eq.  5),  it  is  expected  that  Rwy  will  be 
close  to  the  scaled  autocorrelation  of  the  watermark.  There¬ 
fore,  an  objective  measure  was  derived  which  determines  how 
well  Rwy  matches  the  scaled  autocorrelation  A Rww,  which  is 
known  a  priori. 

Since  the  reference  correlation  Rww  is  an  even  function, 
the  information  in  the  left  and  right  halves  is  equivalent. 
Therefore  only  the  coefficients  in  the  left  half 

cwy{m)  =  Rwy{m  —  N  +  KP/2),m  =  1 .,N 


were  considered  in  the  scoring  function.  Note  that  the  coef¬ 
ficients  are  shifted  to  the  right  by  half  of  the  length  of  Op  so 
that  windowing  can  be  centered  around  each  correlation  peak. 
Finally,  the  correlation  is  squared  and  normalized  to  produce 
the  correlation  sequence 


fH  = 


maxi<fc<j \r{cwy(k)2)  ’ 


m  =  1 . N 


which  becomes  independent  of  A  because  of  the  normaliza¬ 
tion. 

Define  ii, . . . ,  iq  to  be  the  q  peak  indices  of  the  autocor¬ 
relation  sequence  cww(m),  m  =  1 , ,N,  corresponding  to 


when  the  individual  watermarks  (O^p)  align  with  each  other. 
First  the  raw  score 


if  ij  —  axgm&X-lif-KP/iKm-Cij+KP/A,}  Guy(m) 

otherwise 


was  determined  for  each  of  the  q  autocorrelation  peaks.  The 
correlation  detection  score  is  then  calculated  as 

i 

Swy  —  P  ^  ]  I'lnw  (ij  )  T  / 

J=1 

where  the  amplitude  of  the  peaks  cww(ij )  are  used  as  weight¬ 
ing  factors  and  3  =  ^ — 5 — 7-—  scales  the  score  between  0 
and  5.  Since  the  peak  amplitudes  follow  a  triangular  shape 
(see  Figure  2a)  the  weights  were  designed  to  reward  the 
higher  valued  peaks  which  are  less  likely  to  be  dominated  by 
adjacent  noise. 


(b)  cwy  with  6  matching  peak  locations. 


Fig.  2.  Determining  the  Correlation  Detection  Score  of 
speech  with  a  -30  dB  watermark. 

A  cross-correlation  sequence  cwy  between  the  watermark 
and  y,  illustrated  in  Figure  2b,  is  detected  by  comparing  the 
constrained  peak  locations  with  the  corresponding  peak  loca¬ 
tions  of  the  autocorrelation  sequence  cww  shown  in  Figure  2a. 
The  broken  lines  indicate  the  constraint  placed  on  each  peak 
and  the  circles  at  the  peaks  of  cwy  indicate  when  the  highest 
peak  within  each  window  matches  the  corresponding  peak  lo¬ 
cation  of  cww.  In  this  case,  only  six  peaks  matched  giving  a 
correlation  detection  score  Swy  =  4.7544. 

4.  EXPERIMENTAL  RESULTS 

The  following  sections  demonstrate  performance  of  the 
REM,  AL,  and  SIGN  enhancement  methods  using  16 
kHz  clean  speech  and  a  watermark  created  from  a  sequence 
of  DTMF  tones  described  earlier  in  Section  2.1.  For  each 
experiment,  a  1-sec  DTMF  sequence  was  created  (see  Eq.  2) 
using  the  tones  from  the  ten  digit  sequence  ”123456789 A”, 
and  added  to  each  speech  segment  by  repetition  via  the  con¬ 
struction  in  Eq.  (3). 
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4.1.  REM,  AL,  and  SIGN  Methods  with  Speech 


A  male  speaker  from  the  TIMIT  database  was  selected  at  ran¬ 
dom  and  the  speech  from  his  ten  utterances  was  concatenated 
up  to  a  total  duration  of  30-sec.  After  the  DTMF  water¬ 
mark  was  added  at  a  varying  signal  to  noise  ratio,  the  CDS 
was  determined  as  the  value  of  parameter  k  was  modified. 
The  results  for  the  REM  and  AL  method  appear  in  Figure 
3a  and  3b  below.  The  performance  of  both  is  similar:  de- 


(a)  REM  (b)  AL 

Fig.  3.  Correlation  Detection  Score  (CDS)  as  watermark  dB 
level  and  number  of  bits  are  varied. 

creasing  k  enables  one  to  detect  a  weaker  watermark  signal. 
The  SIGN  method  exceeds  or  equals  the  performance  of  the 
other  two  methods  for  every  value  of  k  (30  dB  gain  compared 
with  no  enhancement).  Note,  when  k  =  0:  REM(yn ,  k)  = 
SIGN(yn),  and  AL(yn ,  k)  differs  from  the  other  two  only 
for  yn  =  0  (when  the  three  nonlinear  functions  are  normal¬ 
ized  to  have  the  same  amplitude  range).  Because  of  this,  the 
SIGN  method  was  chosen  for  further  investigation. 

4.2.  SIGN  Method  with  Multiple  Speakers 

To  demonstrate  the  improvement  over  a  wider  range  of  speech 
samples,  performance  was  evaluated  for  20  randomly  se¬ 
lected  male  TIMIT  speakers.  Utterances  from  each  speaker 
were  concatenated  and  the  total  speech  duration  per  speaker 
was  used  to  generate  progressively  longer  speech  segments 
§2,  s 4 , . . . ,  s 24  where  the  subscript  indicates  the  duration  in 
seconds  and  i  is  the  speaker  ID.  The  1-sec  DTMF  sequence 
was  added  to  each  s'-  by  repetition. 

The  lowest  detection  level  (using  a  =  2)  was  calculated 
for  each  speaker  segment  slj,j  =  2,4, ... ,  24;  i  =  1, . . . ,  20. 
The  mean,  over  the  speakers,  is  plotted  in  Figure  4a  as  the 
durations  are  increased.  The  upper  line  in  Figure  4a  shows  the 
lowest  detection  level  without  enhancement,  the  broken  line 
approximates  the  human  detection  threshold,  and  the  lower 
line  shows  an  average  of  26  dB  improvement  after  enhance¬ 
ment.  The  vertical  lines  at  each  data  point  indicate  the  range 
of  plus  or  minus  a  among  the  20  TIMIT  speakers. 

Also  seen  in  Figure  4a  is  that  as  the  speech  segment  dura¬ 
tion  doubles,  the  SNR  detection  level  gains  approximately  the 
expected  3  dB.  Flow  ever,  the  last  5  samples  of  the  enhanced 
plot  line  indicate  that  an  asymptote  is  reached  at  near  -60  dB. 


(a)  Twenty  speakers.  (b)  Single  speaker. 


Fig.  4.  Evaluation  of  SIGN  method. 

This  can  be  explained  because  the  SIGN  method  requires 
that  the  ratio  7  =  ^J2n=ilSIGN(Vn)  ==  SIGN(Xwn)} 
must  not  represent  random  chance.  Varying  the  watermark 
dB  level  on  a  single  30-sec  TIMIT  speech  file  (Figure  4b),  it 
can  be  seen  that  as  the  signal  to  noise  ratio  is  reduced  7  ap¬ 
proaches  0.5.  Also  note  that  the  corresponding  score  drops  to 
zero  near  the  input  SNR  level  where  7  reaches  the  asymptote. 

5.  CONCLUSION 

An  imperceptible  tonal  watermark  can  be  embedded  in  speech 
asynchronously  and  detected  using  unique  combinations  of 
bit  manipulation  enhancement  along  with  a  data-directed  cor¬ 
relation.  This  watermarking  method  meets  the  desired  crite¬ 
ria:  transparent  to  listeners,  minimal  burden  at  insertion,  no 
significant  change  in  the  speech  communication  power,  and 
low  complexity  recovery.  It  is  ideal  for  implementation  in 
simple  hardware.  Under  certain  circumstances,  REM  pro¬ 
duced  better  performance  when  compared  to  the  other  meth¬ 
ods,  however,  in  the  speech  experiments  performed,  REM 
did  not  exceed  the  SIGN  method. 
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